Estimating Home Prices

Estimating home values is an important aspect of the Windfall net worth model. Property often represents a significant portion of an individual’s net worth and, due to changing home values, can be a significant driver of wealth growth.

Your task is to create a model using the attached dataset to attempt to predict home values in Denver.

The dataset contains information on approximately 3,000 single-family homes in Denver, CO. Each row contains the following information about the property that can be used to estimate the home’s value:

  • Address
  • Square footage (ft^2)
  • Lot size (ft^2)
  • Number of bedrooms
  • Number of bathrooms
  • The year the house was built
  • The year the house was last sold
  • The amount the house was last sold for

In addition, we have included the Zillow “Zestimate” (http://www.zillow.com/zestimate) of each home’s current value in the estimated_value column of the attached file. You should use these estimates to train your model.

We have reserved a set of single family home records from Denver that we will use to assess the quality of your model. The reserve dataset is in the same format as the training set.

Please implement a model using either python or R. When you are finished, send us back:

  1. Your code
  2. Instructions for running your model on our reserve dataset
  3. A short explanation of how you arrived at your answer (see below)

Here are some points you may want to cover when writing your explanation:

  • What was your methodology in assessing the data?
  • How did you decide which features were important and which ones were of no value?
  • Did you consider any strategies other than the one you used for your final model? What were they?
  • How did you train your model?
  • How did you evaluate the accuracy of your model?

In [1]:
import pandas as pd
import numpy as np
import scipy.stats
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
csv = pd.read_csv("single_family_home_values.csv", parse_dates=["last_sale_date"])

In [3]:
print csv.shape
csv.head()


(3417, 14)
Out[3]:
address city state zipcode property_type square_footage lot_size num_rooms num_bedrooms num_baths year_built last_sale_amount last_sale_date estimated_value
0 2495 S Milwaukee St Denver CO 80210 Single Family Residence 3885 7500 13 4 5 2010 350000 2010-06-07 1684147
1 2690 Niagara St Denver CO 80207 Single Family Residence 1343 7500 6 3 1 1948 31500 1978-02-01 343104
2 3425 Grove St Denver CO 80211 Single Family Residence 992 6250 5 2 2 1921 258000 2002-02-19 510436
3 4794 Claude Ct Denver CO 80216 Single Family Residence 491 2990 3 1 1 1891 0 2011-04-27 184105
4 3640 N Gaylord St Denver CO 80205 Single Family Residence 1365 4690 6 3 1 1912 52600 1980-09-01 390055

In [4]:
#scale the data
from sklearn import preprocessing
from scipy import stats

Data Preprocessing

Impune Missing Values

Typically missing values in datasets are blank but in this dataset they are filled w 0 values. We can either drop these rows/examples or fill them w/ some averaging technique.


In [5]:
# fill missing values (0's) w/ the median for the column
cols = ["square_footage", "lot_size", "num_rooms", "num_bedrooms", "num_baths", "year_built", 
     "last_sale_amount", "estimated_value"]

for col in cols:
    csv.loc[csv[col]==0, col] = csv[col].median()

In [6]:
csv.head(4)


Out[6]:
address city state zipcode property_type square_footage lot_size num_rooms num_bedrooms num_baths year_built last_sale_amount last_sale_date estimated_value
0 2495 S Milwaukee St Denver CO 80210 Single Family Residence 3885 7500 13 4 5 2010 350000 2010-06-07 1684147
1 2690 Niagara St Denver CO 80207 Single Family Residence 1343 7500 6 3 1 1948 31500 1978-02-01 343104
2 3425 Grove St Denver CO 80211 Single Family Residence 992 6250 5 2 2 1921 258000 2002-02-19 510436
3 4794 Claude Ct Denver CO 80216 Single Family Residence 491 2990 3 1 1 1891 194900 2011-04-27 184105

Location/Address Information

Although location is obviously a VERY important characteristic of home value, we CANNOT use it in its current form in this model.

If we use zip code, an ordinal value, we would have to one-hot encode it, which would add about 30 dimensions, which due to the relatively small size of the training set would add a large factor of bias to the model and would other words drastically overfit.

A more optimal solution, outside of the scope if this exercise would be to reverse geocode the address to lat/lng paris. If I can convert address to lat/lng it could be considered a continuous feature and only add 2 dimensions and get a lot of added value.

For now we have to ignore location information.

P.S.

One of my favorite uses of zip code data is to look up demographic variables based on zipcode that may not be available at the individual level otherwise...

Property Type

This feature is useless, removing it...


In [154]:
model_cols = ["square_footage", "lot_size", "num_rooms", "num_bedrooms", "num_baths", "year_built", 
              "last_sale_amount", "last_sale_date", "estimated_value"]
housing=csv[model_cols]
housing.head()


Out[154]:
square_footage lot_size num_rooms num_bedrooms num_baths year_built last_sale_amount last_sale_date estimated_value
0 3885 7500 13 4 5 2010 350000 2010-06-07 1684147
1 1343 7500 6 3 1 1948 31500 1978-02-01 343104
2 992 6250 5 2 2 1921 258000 2002-02-19 510436
3 491 2990 3 1 1 1891 194900 2011-04-27 184105
4 1365 4690 6 3 1 1912 52600 1980-09-01 390055

Transform Dates

The categorical feature last_sale_date needs to be converted into a continuous form: we will represent the Date objects as integers representing days since the epoch. We will define the epoch as 1/1/1900


In [155]:
housing["day"] = (housing.last_sale_date - pd.datetime(1900,1,1))/np.timedelta64(1,'D')
housing.drop("last_sale_date", 1, inplace=True)


/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

In [156]:
housing.head(5)


Out[156]:
square_footage lot_size num_rooms num_bedrooms num_baths year_built last_sale_amount estimated_value day
0 3885 7500 13 4 5 2010 350000 1684147 40334
1 1343 7500 6 3 1 1948 31500 343104 28520
2 992 6250 5 2 2 1921 258000 510436 37304
3 491 2990 3 1 1 1891 194900 184105 40658
4 1365 4690 6 3 1 1912 52600 390055 29463

Number of rooms

The column num_rooms is the total number of rooms in the property including bedrooms and baths, so there is some overlap between this and the num_bedrooms and num_baths columns. We can subtract the sum of num_bedrooms and num_baths from num_rooms to increase the variance in the column. SO this column will essentially be transformed to other_room_count

Note

I think I don't understand this column, b/c example id 3393 (see below) has less total rooms than bedrooms+bathrooms. I'm just going to skip this optimization for now.


In [163]:
housing.iloc[3393]


Out[163]:
square_footage        1230
lot_size              6250
num_rooms                5
num_bedrooms             3
num_baths                3
year_built            1973
last_sale_amount    335000
estimated_value     467100
day                  41938
Name: 3393, dtype: float64

Outliers


In [10]:
# outlier_check_cols = ["square_footage", "lot_size", "num_rooms", "num_bedrooms", "num_baths", "year_built", 
#      "last_sale_amount", "estimated_value", "day"]

# for col in outlier_check_cols:
#     housing[col] = np.abs(stats.zscore(housing[col])) < 3 

housing_sans_outliers = housing[(np.abs(stats.zscore(housing)) < 3).all(axis=1)]
print housing.shape
print housing[(np.abs(stats.zscore(housing)) < 3).all(axis=1)].shape


(3417, 9)
(3265, 9)

In [11]:
housing_sans_outliers.head(5)


Out[11]:
square_footage lot_size num_rooms num_bedrooms num_baths year_built last_sale_amount estimated_value day
1 1343 7500 6 3 1 1948 31500 343104 28520
2 992 6250 5 2 2 1921 258000 510436 37304
3 491 2990 3 1 1 1891 194900 184105 40658
4 1365 4690 6 3 1 1912 52600 390055 29463
5 1112 5190 5 2 2 1925 15000 433793 29066

In [12]:
#check correlations

from pandas.tools.plotting import scatter_matrix
_=scatter_matrix(housing_sans_outliers, alpha=0.2, figsize=(17, 17), diagonal='kde')


Skewedness

loss minimizers/optimizers usually perform better when the variables have means and std deviations that are not too far apart. Doing log() or sqrt() usually helps reduce the scale of the variables so that they are closer to each other.

I log transformed certain features for which the skew was > 0.75. This will make the feature more normally distributed and this makes linear regression perform better - since linear regression is sensitive to outliers. Note that if I used a tree-based model I wouldn't need to transform the variables.


In [13]:
#log transform skewed numeric features:
numeric_feats = housing_sans_outliers.dtypes[housing_sans_outliers.dtypes != "object"].index

skewed_feats = housing_sans_outliers[numeric_feats].apply(lambda x: stats.skew(x.dropna())) #compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index

print skewed_feats

housing_sans_outliers[skewed_feats] = np.log1p(housing_sans_outliers[skewed_feats])


Index([u'square_footage', u'lot_size', u'last_sale_amount',
       u'estimated_value'],
      dtype='object')
/usr/local/lib/python2.7/site-packages/pandas/core/frame.py:2320: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[k1] = value[k2]

In [14]:
housing_sans_outliers.head()


Out[14]:
square_footage lot_size num_rooms num_bedrooms num_baths year_built last_sale_amount estimated_value day
1 7.203406 8.922792 6 3 1 1948 10.357775 12.745792 28520
2 6.900731 8.740497 5 2 2 1921 12.460719 13.143023 37304
3 6.198479 8.003363 3 1 1 1891 12.180247 12.123267 40658
4 7.219642 8.453401 6 3 1 1912 10.870490 12.874046 29463
5 7.014814 8.554682 5 2 2 1925 9.615872 12.980325 29066

In [15]:
# scaled_sq_footage = preprocessing.scale(csv[["square_footage"]])
_ = housing_sans_outliers[["square_footage"]].boxplot(return_type='dict')
_ = housing_sans_outliers[["square_footage"]].hist(bins=50)


Scaling

For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

We know that scaling is beneficial for linear models. We already sclaed in essence in the previous skew correction step, however, it didn't run for every feature, so we'll just do it here for all features for good measure.


In [105]:
final = housing_sans_outliers[["square_footage", "lot_size", "num_rooms", "num_bedrooms", "num_baths", "year_built", 
         "last_sale_amount", "day", "estimated_value"]]

# through experimentation scaling doesn't help much
# scaled = preprocessing.scale(final.values)
scaled = final.values 
features = scaled[:,0:-1]
labels = scaled[:,-1]

Training

First, let's try Regression w/ Lasso Regulation and then Regression. Then we will try a more sophisticated model: Gradient Boosting and then explore blending the models together.

Let's create a training and test set


In [106]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn import ensemble
from sklearn.metrics import r2_score

In [107]:
#test train split
X_train, X_test, Y_train, Y_test = train_test_split(features, labels, test_size=0.25, random_state=4)

Lasso


In [111]:
lassocv = linear_model.LassoCV(alphas=[0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75])
lassocv.fit(features, labels)
lassocv.alpha_


Out[111]:
0.050000000000000003

In [190]:
lasso = linear_model.Lasso(alpha=lassocv.alpha_)
lasso.fit(X_train, Y_train)
lasso.score(X_test, Y_test)


Out[190]:
0.57609395122369689

As you can see from the above output, Lasso scores an R^2 of .576. Let's check out what features it found important and then compare w/ Ridge:


In [113]:
training_cols = ["square_footage", "lot_size", "num_rooms", "num_bedrooms", "num_baths", "year_built", 
      "last_sale_amount", "day"]
coef = pd.Series(lasso.coef_, index = training_cols)

print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " +  str(sum(coef == 0)) + " variables")

imp_coef = coef.sort_values()

imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Lasso Model")


Lasso picked 5 variables and eliminated the other 3 variables
Out[113]:
<matplotlib.text.Text at 0x1124948d0>

Some of these make sense: Last Sale, Num Baths, Num Rooms are intuitively related to home price prediction. However, it is a bit incomprehensible that it decided to disgard lot size as having no importance. This can be seen as a limitation of the Lasso Regularization, let's see if Ridge can overcome this:

Ridge


In [108]:
regcv = linear_model.RidgeCV(alphas=[0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75])
regcv.fit(features, labels)
regcv.alpha_


Out[108]:
1.0

In [191]:
reg = linear_model.Ridge(alpha=regcv.alpha_)
reg.fit(X_train, Y_train)
ridge_preds = reg.predict(X_test)
reg.score(X_test, Y_test)


Out[191]:
0.66273997829631215

Nice! We have a marked improvement over Lasso!

Gradient Boosting


In [250]:
parameters = {'n_estimators':[100,300,500],'max_depth':[2, 5, 10], 'min_samples_split': [2,4,8], 
              'learning_rate': [0.1, .5]}
gbr = ensemble.GradientBoostingRegressor()
gscv = GridSearchCV(gbr, parameters)
gscv.fit(features, labels)


Out[250]:
['mean_fit_time',
 'mean_score_time',
 'mean_test_score',
 'mean_train_score',
 'param_learning_rate',
 'param_max_depth',
 'param_min_samples_split',
 'param_n_estimators',
 'params',
 'rank_test_score',
 'split0_test_score',
 'split0_train_score',
 'split1_test_score',
 'split1_train_score',
 'split2_test_score',
 'split2_train_score',
 'std_fit_time',
 'std_score_time',
 'std_test_score',
 'std_train_score']

In [251]:
gscv.best_estimator_


Out[251]:
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.1, loss='ls', max_depth=5, max_features=None,
             max_leaf_nodes=None, min_impurity_split=1e-07,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=100,
             presort='auto', random_state=None, subsample=1.0, verbose=0,
             warm_start=False)

In [252]:
params = {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 2,
          'learning_rate': 0.1, 'loss': 'ls'}
gbr = gscv.best_estimator_
gbr.fit(X_train, Y_train)
gbr.score(X_test, Y_test)


Out[252]:
0.77661911073216805

In [242]:
0.76074209487440014


Out[242]:
<matplotlib.text.Text at 0x11060be50>

In [115]:
gb_preds = gbr.predict(X_test)

Blended Model

If you have multiple models that you blend together to reduce variance and make your predictions more stable. We will do a simple linear combination of the base predictors: Ridge and GB

And there you have it, our blended model performs better (76.63%) than its parts (76.22% & 57.60%)

Now let's formalize the blended model w/ mlxtend and run Evaluation tests:


In [200]:
from mlxtend.regressor import StackingRegressor

In [253]:
meta = linear_model.LinearRegression()
blender = StackingRegressor(regressors=[reg, gbr], meta_regressor=meta)
_=blender.fit(X_train, Y_train)
y_pred = blender.predict(X_test)
blender.score(X_test, Y_test)


Out[253]:
0.76826629438944538

Evaluation

10-Fold Cross Validation to evaluate the average score of out model across 10 random samples from out dataset.


In [228]:
from sklearn.model_selection import cross_val_score

In [254]:
scores = cross_val_score(blender, features, labels, cv=10)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


Accuracy: 0.79 (+/- 0.04)

As you can see the R^2 is .78 which is very respectful.

Lets take the average diff between actual and predicted: On this training set it is about 18% error which is very respectful.


In [255]:
mean_diff = np.mean(np.abs(np.exp(Y_test)-np.exp(y_pred)))
p_mean_diff = np.mean(mean_diff/np.exp(Y_test))
print "Mean Error:\t %.0f/%0.3f%%" % (mean_diff, p_mean_diff*100)


Mean Error:	 62694/17.428%